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Abstract 

Background: Scientists routinely scan DNA sequences for transcription factor (TF) binding sites (TFBSs). Most of the 
available tools rely on position-specific scoring matrices (PSSMs) constructed from aligned binding sites. Because of 
the resolutions of assays used to obtain TFBSs, databases such asTRANSFAC, ORegAnno and PAZAR store unaligned 
variable-length DNA segments containing binding sites of a TF. These DNA segments need to be aligned to build a 
PSSM. While theTRANSFAC database provides scoring matrices for TFs, nearly 78% of theTFs in the public release do 
not have matrices available. As work on TFBS alignment algorithms has been limited, it is highly desirable to have an 
alignment algorithm tailored to TFBSs. 

Results: We designed a novel algorithm named LASAGNA, which is aware of the lengths of input TFBSs and utilizes 
position dependence. Results on 1 89 TFs of 5 species in the TRANSFAC database showed that our method 
significantly outperformed ClustalW2 and MEME. We further compared a PSSM method dependent on LASAGNA to 
an alignment-free TFBS search method. Results on 89 TFs whose binding sites can be located in genomes showed 
that our method is significantly more precise at fixed recall rates. Finally, we described LASAGNA-ChIP, a more 
sophisticated version for ChIP (Chromatin immunoprecipitation) experiments. Under the one-per-sequence model, it 
showed comparable performance with MEME in discovering motifs in ChlP-seq peak sequences. 

Conclusions: We conclude that the LASAGNA algorithm is simple and effective in aligning variable-length binding 
sites. It has been integrated into a user-friendly webtool for TFBS search and visualization called LASAGNA-Search.The 
tool currently stores precomputed PSSM models for 1 89 TFs and 1 33 TFs built from TFBSs in the TRANSFAC Public 
database (release 7.0) and the ORegAnno database (08Nov1 0 dump), respectively. The webtool is available at http:// 
biogrid.engr.uconn.edu/lasagna_search/. 



Background 

A transcription factor is a protein that regulates the 
expression of its target genes by physically binding to the 
promoter regions of these genes. The binding sites of a 
transcription factor (TF) naturally share similarity with 
each other. The common pattern shared among the bind- 
ing sites of a TF is called a motif. In general, there are 
two approaches to computational motif analysis, de novo 
motif discovery [1-12] and transcription factor binding 
site (TFBS) search [13-17]. As the name suggests, de novo 
motif discovery algorithms find over-represented patterns 
in sequences without prior knowledge of the binding TFs. 
The input to these algorithms is usually the upstream 
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region sequences of genes putatively co-regulated by one 
or more common TFs. The output is one or more motifs 
or patterns whose instances are over-represented in the 
input sequences. On the other hand, a TFBS search algo- 
rithm takes binding site sequences of a TF as input. It 
learns from these known binding sites and builds a TF 
model out of them. The TF model can then be used to 
scan sequences for putative binding sites. While the two 
approaches are tightly connected, we focus on the TFBS 
search problem and assume that a TF has known binding 
sites available. 

A typical TFBS search algorithm requires aligned 
TFBSs. This requirement allows simple representations 
of TF models. Types of TF models include consensus 
sequences, position-specific scoring matrices (PSSMs) 
[18], etc. The PSSM method is a widely used method 
among the available TFBS search algorithms. Given 
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aligned binding sites of a TF, the TF model is essen- 
tially a 4 x / matrix, where / is the length of the binding 
sites. Column i of the matrix stores the scores of match- 
ing the 2 th letter in a sequence of length / (an /-mer) to 
nucleotides A, C, G and T, respectively. The score of an l- 
mer is then calculated by summing up the scores of letter 
1 through letter /. Depending on the variant of PSSM, the 
score of A at position i can be the count of A at position 
i in the known TFBSs, the log-transformed probability of 
observing A at position i, or any other reasonable num- 
ber. Once constructed, the matrix of a TF can be stored 
in a database to scan sequences for binding sites of the TF 
in the future without resorting to the actual binding sites. 
In fact, many tools [14,19-24] depend on matrices stored 
in at least one of the databases, JASPAR [25], RegulonDB 
[26] and TRANSFAC [27]. Since a matrix is constructed 
from aligned binding sites, we cannot overemphasize the 
quality of TFBS alignments. 

Databases such as JASPAR, TRANSFAC and ORe- 
gAnno [28] contain DNA segments bound by TFs. These 
DNA segments can be seen as TFBSs with some irrele- 
vant bases on one or both sides because of the resolutions 
of techniques used to obtain TFBSs. The DNA segments 
belonging to a TF are therefore unaligned variable-length 
sequences. While the DNA segments for most TFs in the 
JASPAR database are aligned, this is not the case for the 
TRANSFAC public and ORegAnno databases. About 53% 
(983 out of 1867) of the TFs in the TRANSFAC Pub- 
lic database (release 7.0) have unaligned variable-length 
DNA segments. Moreover, nearly 78% (1447 out of 1867) 
of TFs having curated DNA segments do not have a 
matrix. Focusing on TFs with variable-length DNA seg- 
ments, about 71% (669 out of 983) of them do not have a 
matrix. On the other hand, the ORegAnno database stores 
experimentally validated DNA segments bound by TFs 
but does not provide matrices. About 31% (175 out of 572) 
of the TFs therein have variable-length DNA segments. In 
the absence of a matrix, to search for binding sites of these 
TFs using a matrix dependent tool, one needs to first align 
the curated DNA segments for each TF. In the rest of this 
paper, we refer to (variable-length) DNA segments con- 
taining TFBSs as (variable-length) TFBSs for simplicity 
reasons. 

In this work, we propose a novel TFBS alignment algo- 
rithm named LASAGNA (Length- Aware Site Alignment 
Guided by Nucleotide Association). The algorithm is 
based on the hypothesis that binding sites of a TF share 
a core [29], a short and highly conserved stretch of DNA. 
Hence, a binding site can be seen as a core with some irrel- 
evant bases on one or both sides. In general, shorter sites 
tend to contain fewer irrelevant bases and are easier to 
align. For this reason, we progressively align the binding 
sites from the shortest to the longest ones. The algo- 
rithm further exploits dependence between two positions 



in a binding site. Dependence between positions has been 
shown to boost performance of TFBS search algorithms 
[13,16] as well as protein structural motif recognition [30]. 
To our best knowledge, this idea has never been applied 
to multiple sequence alignment. We further describe 
a more sophisticated version, named LASAGNA-ChIP, 
for aligning peak sequences produced by ChlP-seq 
experiments. 

To compare algorithms for TFBS alignment, we con- 
duct cross-validation (CV) experiments on 4771 binding 
sites of 189 TFs across 5 species extracted from the 
TRANSFAC Public database (release 7.0). We compare 
LASAGNA to ClustalW2 [31,32] and MEME [1]. Being 
a widely used multiple sequence alignment algorithm, 
ClustalW2 was used to produce gapped TFBS alignments 
in creating the MAPPER database [33] as well as to pro- 
duce both gapped and gapless TFBS alignments in [16]. 
ClustalW2 and other similar algorithms focus on produc- 
ing structurally correct alignments, while other improved 
algorithms rely on structural or homology information 
[34]. ClustalW2 can be viewed as a representative of these 
algorithms when no information other than sequences is 
available. MEME, on the other hand, is a de novo motif 
discovery tool, whose input is typically regulatory regions 
of length 1,000 bp upstream of the genes presumably con- 
trolled by a common TF [35]. Nevertheless, a motif found 
in the input TFBSs can be used to align the TFBSs. In 
fact, MEME is employed by the PAZAR database [36] to 
dynamically align TFBSs and generate PSSMs. We show 
that LASAGNA significantly outperforms ClustalW2 (p- 
value: 1.22 x MT 15 ) and MEME (p-value: 3.55 x 10" 15 ). 

To scan promoters for new TFBSs based on variable- 
length known TFBSs, we couple a PSSM method with 
LASAGNA, denoted by LASAGNA-PSSM. That is, the 
input variable-length TFBSs are aligned by LASAGNA 
and a PSSM model is built from the alignment. It 
is useful to compare an alignment-based TFBS search 
method to an alignment-free method. Therefore, we fur- 
ther compare LASAGNA-PSSM to SiTaR [17], which 
accepts variable-length input TFBSs. To our best knowl- 
edge, SiTaR is the only alignment-free method capable 
of handling variable-length input TFBSs at the time of 
writing. Cross-validation results on 90 TFs whose bind- 
ing sites can be located in respective genomes indi- 
cate that LASAGNA-PSSM is significantly more precise 
at fixed recall rates (p-value: 2.66 x 10 -8 ). The recall- 
precision curve also shows that our method is constantly 
more precise at any recall rate and more sensitive at any 
precision. 

Finally, we demonstrate the application of LASAGNA - 
ChlP to ChlP-seq data using 38 mouse ChlP-seq exper- 
iments. We show that, assuming the one-per-sequence 
model, LASAGNA-ChIP is comparable to MEME in 
revealing the motif of the ChlPed TF or its cofactor. For 
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both LASAGNA-ChIP and MEME, the ChlPed TF motif 
was found in 31 experiments, while a cofactor motif was 
found in 3 experiments. While the two methods differ 
in the rest 4 experiments, the found motifs have sim- 
ilar information content and may belong to unknown 
cofactors. 

Methods 

We describe our novel alignment algorithm in this section. 
LASAGNA utilizes a search module to align a new bind- 
ing site with a partial alignment. Thus, we introduce the 
search module followed by the LASAGNA algorithm. 

The search module 

The search module of LASAGNA is a function learned 
from a (partial) TFBS alignment to score /-mers. It consid- 
ers nucleotide pairs in addition to individual nucleotides 
so as to exploit dependence between positions. We intro- 
duce our choice of the search module, the PSSM model 
described in [13]. We denote it by PSSM^(-) in this work. 

Suppose that a PSSM is constructed from aligned 
sequences of length /. The score of letter u at position i is 
given by 



Mi(u) = log 



fi(u) 
/(«)' 



where fi(u) is the probability of observing letter u at posi- 
tion i and f(u) is the background probability of seeing 
letter u. Similarly, the score of a pair of letters (u, v) at 
position is given by 

fa(u, v) 

M uj {u,v) = \og J ^---, 
f{u, v) 

where fi,j(u, v) is the probability of observing nucleotide 
pair (u, v) at position («',/) and f(u, v) is the background 
probability of seeing the pair. The score of s, a sequence of 
length /, is then 

/ K l-k 

PSSM^(s) = J2 Mi(si) + M tM> */). (!) 

/=1 k=l i=l 

where s; denotes the z' th letter of s, j = i + k and K is the 
scope parameter defined in [13]. The parameter K con- 
trols how far apart a pair of nucleotides can be. When 
K = 1, only adjacent nucleotide pairs are scored. We 
define PSSMn(s) = £!=iAf<0;), that is, we do not score 
nucleotide pairs when K = 0. 
Our search module is a variant of (1). Let 



M\{u) 



min* Mi (x) if u is the gap letter 
Mi(u) otherwise 



and 



, _ \mm Xi yMij(x, y) if u or v is the gap letter 
' ~ \Mij(u, v) otherwise 



The search module is defined as follows: 

/ K l-k 

PSSM^(s) = J2 M ife) + M 'iM' S A ( 2 ) 

(=1 k=l i=l 

where superscript a denotes alignment as this module is 
used in our alignment algorithm. 

The LASAGNA algorithm 

The algorithm is based on the idea that the binding sites 
of a TF share a common core, a conserved short DNA 
sequence. A binding site can then be seen as a core with 
a few irrelevant bases on one or both sides. Assuming 
that each binding site fully contains the core, the shorter a 
binding site, the fewer irrelevant bases it contains. There- 
fore, we progressively align the binding sites by aligning 
the shortest binding site with the already aligned binding 
sites until all the binding sites are aligned. 

The algorithm takes a set of unaligned binding sites, U, 
and parameter K & as inputs. Let A denote the set of aligned 
binding sites. A binding site in A may have gap letters 
added to one or both ends as a result of the alignment. The 
algorithm works as follows: 

1. Initialize A to {s}, where s, the seed site, a shortest 
binding site arbitrarily chosen from U. Remove s 
from U. 

2. (a) Build PSSM^(-) from A. Let the length of 

this PSSM be'l 

(b) Remove the shortest binding site s from U. 

(c) Create S, the augmented sequence of s, by 
adding / — 1 gap letters to both ends of s. 

(d) Score each i-mer of S by PSSM^(-) to find 
the highest scoring one. 

(e) Let s be its reverse-complement and repeat 
c-d. That is, the opposite strand is 
considered. 

3. Add s to A if the highest scoring i-mer resides in s. 
Otherwise, add its reverse-complement to A. Gap 
letters are added to one or both ends of sequences in 
A. This ensures that they are all of the same length 
and each column of the alignment has at least one 
non-gap letter. 

4. Repeat 2-3 until U is empty. 

In step 2b, there may be more than one shortest binding 
sites in U. To break the tie, we use PSSM^ (•) to scan each 
of the shortest ones. The "s" containing the highest scoring 
l-mer is removed from U to align with sequences in A In 
the unlikely case of two or more shortest binding sites in U 
sharing the same highest score, one is arbitrarily chosen. 
Figure 1 illustrates an iteration of the algorithm. 

An alignment may be trimmed before building a PSSM. 
We describe one way of trimming aligned TF binding sites 
using two simple measures. Let / be the length of the 
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(a) 



-GCGCTAA- - 

- - CGCCAAA- 
-GCGCCAAA- 
-GCGCCAAA- 
-GCGCCAAA- 
-GCGCGAAA- 

- GCGCCAAT - 
-CCGCCAAA- 
-CCGCGAAA- 

- - CGCGGAAA 
-GCGCGAAG- 
-GCGGGAAA- 
-GCGCGATC- 
-CCCGGAAA- 
CGCGCCAAA- 
- GCGCGAAAA 
CCCGCCAGG- 



(c) 



(b) 



PSSM 




GCGCTAA- - 

CGCCAAA- 

GCGCCAAA- 

GCGCCAAA- 

GCGCCAAA- 

GCGCGAAA- 

GCGCCAAT - 

CCGCCAAA- 

CCGCGAAA- 

CGCGGAAA 

GCGCGAAG- 

GCGGGAAA- 

GCGCGATC- 

CCCGGAAA- 

CGCGCCAAA- 

GCGCGAAAA 

CCCGCCAGG- 

TTTCCCGCCAA- - 



TTTCCCGCCAA - 



U 



TTTCCCGCCAA 

TTTCGCGCCAAA 
TTTGGCGGGCGGCC 
CAATTTTCGCGCGG 
CCATTTTCGCGGGAA 



u 



TTTCGCGCCAAA 
TTTGGCGGGCGGCC 
CAATTTTCGCGCGG 
CCATTTTCGCGGGAA 



Figure 1 An illustration of LASAGNA with = 0. (a) The aligned binding sites in A and the unaligned ones in U. The shortest binding site is in 

bold. (b)The sequence logo [48] of the PSSM built from A aligns with the augmented sequence TTTCCCGCCAA , where the 

matched portion is in bold, (c) The updated A and U, where the newly added binding site is in bold. 



aligned binding sites. We first compute and denote the 
percentage of non-gap letters at position i of the align- 
ment by Ci, for i = 1, 2, . . . , /. The information content 
(IC) at each position is then computed with small sample 
correction described in [37]. That is, 



max 



0, 2+ J2 Mu)log 2 fi(u) 

ue{A, C,G,T} 



where i e {1, 2, ... , I}, ni is the number of non-gap let- 
ters at position i and e(-) gives the approximated sampling 
error. Let C m i n and /Cmin be the cutoff thresholds. The 
alignment is examined from the left end to the right 
until the first position j satisfying both Cy > C m in and 
ICj > /Cmin is encountered. The positions preceding / are 
trimmed off. The trimming is similarly applied to the right 
end. 

LASAGNA for ChlP-seq data 

Although LASAGNA is not specifically designed as a 
de novo motif discovery algorithm, a more sophisticated 
version, named LASAGNA-ChIP, is capable of handling 
ChlP-seq data. Here, we refer to the previous section 
and describe the additional steps that are necessary for 
aligning ChlP-seq peak sequences. The flowchart in Addi- 
tional file 1 gives an overview of LASAGNA-ChIP 

Before aligning ChlP-seq peak sequences, each 
sequence is clipped to 100 bp surrounding the signal peak. 
This is a common practice since, for most peak sequences 
(> 90%), the actual TFBS is usually found within 50 bp of 



the called peak [38]. In step 2a, we trim the partial align- 
ment A if it contains more than two sequences. Unlike 
TFBSs found in databases such as TRANSFAC, even after 
clipping, a peak sequence contains much more irrele- 
vant bases flanking the core. The trimming procedure 
described in the previous section is used, where C m i n 
(7C m in) is set to the mean C (/Q) over all the columns of 
A. The resulting alignment is further trimmed by IC such 
that it has at most 15 columns and the columns on both 
ends have positive IC. In step 2b, if there are more than 5 
shortest binding sites in U. Only 5 are arbitrarily chosen 
to break the tie by similarity to PSSM^- (■)• 

The alignment A obtained by the modified procedure is 
further refined as follows: 

1. Set T to A trimmed to 1 columns as described above. 

2. Build PSSM^(-) out of T. 

3. Initialize R to {}, the refined partial alignment. 

4. For each peak sequence s, 

(a) Create S, the augmented sequence of s, by 
adding I — 1 gap letters to both ends of s. 

(b) Score each i-mer of S by PSSM^(-) to find 
the highest scoring one. 

(c) Let s be its reverse-complement and repeat 
a-b. 

(d) Add s to R if the highest scoring 1-mei 
resides in s. Otherwise, add its 
reverse-complement to R. Gap letters are 
added to one or both ends of sequences in R. 



Lee and Huang BMC Bioinformatics 201 3, 14:1 08 
http://www.biomedcentral.eom/1 471 -21 05/1 4/1 08 



Page 5 of 13 



5. Set A to R and repeat 1-5 until the sum of IC across 
columns of T does not change in 3 iterations. 

For ChlP-seq peak sequences, the shortest sequence 
may miss or contain only a fraction of the core. Hence, 
using the shortest sequence as the seed site sometimes 
results in an alignment with less IC. For this reason, five 
additional sequences are arbitrarily chosen as the seed site 
to produce 5 additional alignments. Among the 6 align- 
ments, the one with the most IC after trimming is chosen 
as the final alignment. 

Scoring a putative binding site 

Although a PSSM suggests the length of a putative binding 
site, we do not restrict the length of a candidate binding 
site to the length of the PSSM. A putative binding site 
could be of any reasonable length. If a true binding site 
is flanked by a few irrelevant bases, this sequence should 
be given a relatively high score compared to those of non- 
binding sites. Therefore, to score a putative binding site s, 
we slide s through the PSSM as described in the previous 
section. The score of sequence s is given by 

Score/c s (s) = max PSSM^ s (5 !:(i+ /_i)), (3) 

i€jl,2,...,/+/ s -l) 

where / is the length of the PSSM, l s is the length of s, S 
denotes the augmented sequence of s with / — 1 gap letters 
on both ends and PSSMj^O is defined in (1). 

Results and discussion 

Comparison of alignment algorithms 
Data sets 

We downloaded all the TF binding sites from the TRANS - 
FAC Public database (release 7.0). The binding sites were 
grouped by species and TF. Binding sites having less than 
4 nucleotides were discarded. TFs of each species were fil- 
tered such that each TF has at least 10 binding sites. This 
ensures that each TF has enough binding sites to con- 
struct a PSSM. The numbers of TFs and TFBSs are listed 
in Table 1. 

To facilitate experiments, we planted each TFBS in a 
2000 base random sequence simulated by a first-order 



Table 1 TFBSs in TRANSFAC public database by species 



Species 


tfTFs 1 


# TFBSs 2 


Homo sapiens 


68 


1984 


Mus musculus 


53 


966 


Rattus norvegicus 


26 


633 


Drosophila melanogaster 


29 


935 


Saccharomyces cerevisiae 


13 


253 


Overall 


189 


4771 



1. The total number of TFs. 

2. The total number of TFBSs. 



Markov chain of the species in question. Except for Sac- 
charomyces cerevisiae, the Markov chain of a species was 
learned from promoter sequences in the UCSC Genome 
Browser database [39]. For Saccharomyces cerevisiae, the 
promoter sequences were retrieved from the SCPD [40] 
using the yeast gene list in euGenes [41]. 

Performance assessment and evaluation metrics 

Since the purpose of aligning TFBSs is to construct a 
PSSM, the quality of an alignment is best measured by the 
search performance of the PSSM. The performance of a 
TFBS search method is evaluated by v-fold CV. Consider 
a TF with n binding sites. The n TFBSs are first divided 
into v sets, each of which contains L^J or [^J + 1 TFBSs. 
At each iteration of the v-fold CV, one of the v TFBS sets 
called the test TFBS set, Ptesti is left out. The rest of the 
TFBSs are aligned to build a PSSM. Each test TFBS in 
-Ptest is then planted in a 2000 base random sequence and 
scanned by the PSSM, scoring each l-mer, where / is the 
length of the test TFBS. We score both the forward and 
reverse strands of an l-mer and assign the higher score to 
it. An l-mer is considered a hit if it shares more than L//2J 
bases with the test TFBS. The /-mers can then be divided 
into two sets, H and N, where H is the set of hits and N 
is considered the set of non-binding sites. The score of 
the test TFBS is the highest score of hits in H. For each 
test TFBS t e Ptest. we find its rank relative to all the 
non-binding sites in N. Formally, the rank of binding site t 
equals 1 + \{s € AT | Scorers) > Scorejf s (£)}|. 

After the v-fold CV, we end up with n ranks, each of 
which corresponds to a TFBS. We use the area under the 
ROC curve (AUC) to gauge the quality of alignment. The 
ROC curve is a plot of true positive rate (TPR) against 
false positive rate (FPR), displaying the trade-off between 
TPR and FPR. We refer readers to [42] for an introduc- 
tion to this metric. In this study, v = 10 for all the CV 
experiments. 

Comparison with ClustalW2 

In general, gapless alignment is preferred over gapped 
alignment for aligning TFBSs. Because of the nature of 
ClustalW2, the alignment of TFBSs may contain gaps in 
the middle of some binding sites. This is disadvantageous 
to ClustalW2 as the PSSM method does not allow inser- 
tion of gaps into the sequence being scanned. Hence, 
we turned off gaps by setting the gap opening penalty 
parameters to a large value, i.e., we set both GAPOPEN 
and PWGAPOPEN to 100000. Indeed, results indicated 
that overall the "gapped" ClustalW2 performs slightly 
worse than the "gapless" variant (p-value: 0.277). For both 
LASAGNA and ClustalW2, parameter K s in Eq. 3 was 
searched from 0 to min{10, l m i n — 1} for each TF and the 
one producing the highest AUC is used, where l m i n is the 
minimal length of the TFBSs. For LASAGNA, parameter 
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7<a of the LASAGNA algorithm was set to K s as the two 
parameters are closely related. 

We conducted 10-fold CV on each TF. The overall ROC 
curves are shown in Figure 2. The ROC curves are based 
on the ranks of 4771 TFBSs of 189 TFs. It shows that 
LASAGNA has invariably higher true positive rate than 
ClustalW2. The AUC score was calculated for each TF and 
for each method. To gauge the significance of difference, 
the Wilcoxon signed-rank test [43] was performed for 
each species. The tests showed that LASAGNA is consis- 
tently better than ClustalW2 across the 5 species. Table 2 
shows the test results. Overall, LASAGNA performed sig- 
nificantly better than ClustalW2 in terms of AUC scores. 
The species-wise jj-values shows that LASAGNA is signif- 
icantly better (< 0.05) than ClustalW2 for aligning TFBSs 
of all the 5 individual species. 

To better understand the results, we split the 189 TFs 
into two groups. One contains TFs on which LASAGNA 
performed better than ClustalW2 and the other contains 
the rest of the TFs. Three factors are examined for each 
TF. They are the number of TFBSs, the mean and stan- 
dard deviation of TFBS length. For each factor, we looked 
for difference between the two groups. Table 3 shows the 
comparisons. It can be seen that LASAGNA produces bet- 
ter alignments when a TF has fewer binding sites but the 
difference is not significant. The mean and standard devi- 
ation of TFBS length are the two more important factors. 



We believe that LASAGNA is well-suited for aligning 
TFBSs that are longer and more variable in length. 

Comparison with MEME 

The MEME tool in the MEME Suite 4.8.1 was used. The 
parameter minw, minimal width of motifs, was set to the 
smaller of 6 and the minimal length of input TFBSs. The 
option revcomp to search the reverse strand was turned 
on. Finally, the parameter minsites was set to the num- 
ber of input TFBSs since a common motif is supposed to 
appear at least once in each TFBS. To ensure that MEME 
functions properly, binding sites shorter than 8 bases are 
padded with gap letters since genomic locations are not 
available for most TFBSs. 

The experiments were carried out in the same man- 
ner as the ClustalW2 experiments. The overall ROC 
curve in Figure 2 indicates that LASAGNA has consis- 
tently higher true positive rates than MEME across dif- 
ferent false positive rates. The overall and species-wise 
comparisons between LASAGNA and MEME in Table 4 
show that LASAGNA performed significantly better than 
MEME. To gain some insights into the difference between 
LASAGNA and MEME, we similarly examined the three 
factors used to compare LASAGNA and ClustalW2. As 
seen in Table 5, the number of input TFBSs is the only 
significant (p-vahie < 0.05) factor out of the three. The 
reasons are not clear but may be investigated in the 
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LO 
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O 
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CM 
O 



o 
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Figure 2 Overall ROC curves for the three alignment algorithms. The left panel shows the curves at low false positive rates, from 0 to 0.02. The 
right panel presents the curves at false positive rates from 0.02 to 0.6. The three methods are indistinguishable when the false positive rate is greater 
than 0.6 and hence the region is not shown. We note that the vertical axes of the two panels are on different scales. 
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Table 2 Species-wise and overall comparisons between LASAGNA and ClustalW2 


Species 


# better 1 


# ties 2 


#TFs 3 


p-value 4 




H. sapiens 


54 (79.4%) 


0 


68 


4.42 x 1 0" 


-7 


M. musculus 


42 (79.2%) 


0 


53 


1.41 x 10" 


-5 


D. melanogaster 


22 (75.9%) 


0 


29 


9.89 x 10" 


-4 


S. cerevisiae 


9 (69.2%) 


1 


13 


3.88 x 10" 


-2 


R. norvegicus 


20 (76.9%) 


1 


26 


1.54 x 10" 


-3 


Overal 


147 (77.8%) 


2 


189 


1.22 x 10" 


■15 



1. Number of TFs on which LASAGNA performs better than ClustalW2. 

2. Number of TFs on which LASAGNA and ClustalW2 have the same performance. 

3. Total number of TFs for a species. 

4. Wilcoxon signed-rank test p-value. 



future. Moreover, it will be helpful to identify other (bio- 
logically meaningful) factors that can better explain the 
performance difference. 

Distribution ofK s 

In Additional file 2, for LASAGNA, ClustalW2 and 
MEME, we show the distribution of K s for a TF by 
species and conserved domain. Overall, we observe that 
small values are preferred for all three methods. By visual 
inspection, LASAGNA appears more similar to MEME 
than ClustalW2 in the usage of K s . It can be seen that the 
usage of Ks differs among different conserved domains. 
Related conserved domains such as ZF-H2C2_2 and ZF- 
C2H2 display similar patterns. This is not surprising as 
conserved domains in a protein are often computation- 
ally predicted. Hence, a protein is likely to possess related 
conserved domains. While overall the distributions seem 
method-dependent, we observe that, for ZF-H2C2_2 and 
ZF-C2H2, the distributions center around 4 across all 
three methods. Finally, we note that these observations are 
preliminary and more TFs are needed to draw statistically 
sound conclusions. 

Comparison of TFBS search methods 
Data sets 

To compare with an alignment-free TFBS search method, 
SiTaR, [17], we retrieved real promoter sequences embed- 
ding TFBSs. Specifically, we followed the curated location 



Table 3 Comparison of two groups of TFs divided 
according to results on LASAGNA and ClustalW2 



Factor 


Group 1 1 mean 


Group 2 2 mean 


p-value 3 


# TFBSs 4 


25.07483 


25.83333 


0.1409 


Mean of TFBS length 


18.78626 


17.56167 


0.08451 


SD of TFBS length 5 


8.180204 


6.921905 


0.06295 



1. LASAGNA performed better than ClustalW2 on TFs in this group. 

2. ClustalW2 performed better than or equal to LASAGNA on TFs in this group. 

3. Wilcoxon signed-rank test p-value. 

4. Number of binding sites for each TF. 

5. Standard deviation of binding site length for each TF. 



of each binding site in the TRANSFAC Public database 
(release 7.0) to retrieve the 1000-base sequences flanking 
the binding site. We discarded binding sites that can- 
not be found in the proximity of the curated locations. 
The retrieved binding sites were grouped by TF and TFs 
having less than 10 binding sites were removed. After fil- 
tering, we ended up with 90 TFs and 1751 binding sites. A 
TF may be present in more than one species as homologs 
and hence the binding sites of a TF may be located in 
genomes of multiple species. The species and respective 
numbers of binding sites are shown in Table 6. 

Performance assessment and evaluation metrics 

To compare with SiTaR [17], we adopt the same v-fold CV 
process used to compare LASAGNA with ClustalW2 and 
MEME. However, we do not assume that a TFBS search 
method scores all the /-mers in a promoter sequence, 
where / is the length of binding sites. Instead, a TFBS 
search method scans a promoter sequence and predicts a 
list of binding sites with respective scores. The predicted 
binding sites may be of different lengths, which is the case 
for SiTaR. 

We describe how a hit is determined. Let the length of 
a predicted binding site be / and the length of the test 
TFBS in question be l s . The predicted binding site is con- 
sidered a hit to the test TFBS if the overlap between the 
two sequences is more than L/ S /2J bases as in [17]. In case 
this is not possible, i.e., / < [l s /2], the predicted binding 
site must be embedded in the true one to be deemed a hit. 

Using the n ranks of TFBSs from v-fold CV, we compute 
recall (true positive rate), precision and the Fp -measure, 
where = 0.5 as in [17]. Let the recall rate be r. The num- 
ber of TFBSs recalled by the method is pi = n x r. Let 
the number of non-binding sites or false positives intro- 
duced be p-f. The precision is given by p ^ pf > while Fp = 

il _L #2\ precisionx recall 
^ " ' ft 2 xprecision+recall ' 

Comparison with an alignment-free method 

We conducted 10-fold CV on the aforementioned 90 
TFs. The PSSM method dependent on LASAGNA 
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Table 4 Species-wise and overall comparisons between LASAGNA and MEME 


Species 


# better 1 


# ties 2 


#TFs 3 


p-value 4 


H. sapiens 


41 (60.3%) 


0 


68 


7.83 x 10~ 3 


M. musculus 


41 (77.4%) 


0 


53 


8.79 x 10~ 6 


D. melanogaster 


26 (89.7%) 


0 


29 


1.02 x 10~ 7 


S. cerevisiae 


10(76.9%) 


3 


13 


2.96 x 10~ 3 


R. norvegicus 


23 (88.5%) 


1 


26 


1.73 x 10~ 4 


Overall 


141 (74.6%) 


4 


189 


3.55 x 10~ 15 



1 . Number of TFs on which LASAGNA performs better than MEME. 

2. Number of TFs on which LASAGNA and MEME have the same performance. 

3. Total number of TFs for a species. 

4. Wilcoxon signed-rank test p-value. 



(LASAGNA-PSSM) was compared to SiTaR [17]. 
LASAGNA considered both strands of a sequence when 
aligning binding sites. The parameters 7<C a = /<s were 
determined in the same way as in comparing LASAGNA 
to ClustalW2. An alignment was trimmed with C m i n = 0.4 
and /Cmin = 0 before constructing a PSSM as described 
in the method section on the LASAGNA algorithm. The 
PSSM method uses a cutoff score to predict TFBSs. The 
cutoff score is set to the minimal score of the constitut- 
ing binding sites of the PSSM. The SiTaR method has a 
mismatch parameter and the maximal value allowed by 
its webtool is 5. We searched in the range from 0 to 5 to 
find the mismatch value giving the highest Fp -measure 
for each TF. 

In terms of the F^, no significant difference was found 
between the two methods (p-value: 0.392 [43]). To ensure 
a fair comparison, we fixed the recall rate for each TF 
and compare the precision achieved by LASAGNA-PSSM 
and SiTaR. The recall rate was set to the lower of the 
recall rates attained by LASAGNA-PSSM and SiTaR. The 
TF c-Jun (AC: T00132) was excluded from comparison 
because SiTaR did not recover any TFBS. Figure 3a shows 
the plot of precision by LASAGNA-PSSM against that 
by SiTaR. At fixed recall rates, LASAGNA-PSSM is more 
precise than SiTaR on 65 out of 89 TFs (p-value: 2.66 x 
10 -8 ). Figure 3b shows the plots of precision against recall 
based on all the recalled TFBSs by each method. It can 
be seen that LASAGNA-PSSM is constantly more precise 

Table 5 Comparison of two groups of TFs divided 



according to results on LASAGNA and MEME 



Factor 


Group 1 1 mean 


Group 2 2 mean 


p-value 3 


# TFBSs 4 


23.33333 


30.85417 


0.03196 


Mean of TFBS length 


18.33468 


19.04125 


0.3007 


SD of TFBS length 5 


7.95844 


7.730625 


0.1846 



1 . LASAGNA performed better than MEME on TFs in this group. 

2. MEME performed better than or equal to LASAGNA on TFs in this group. 

3. Wilcoxon signed-rank test p-value. 

4. Number of binding sites for each TF. 

5. Standard deviation of binding site length for each TF. 



than SiTaR at the same recall rate. Moreover, LASAGNA- 
PSSM recovered substantially more TFBSs than SiTaR at 
the same precision. 

Results reported in [17] showed that SiTaR is highly pre- 
cise and sensitive. Although SiTaR accepts variable-length 
binding sites, all the experiments presented in [17] used 
fixed-length binding sites as inputs. It is therefore not 
clear how SiTaR performs on TFs having variable-length 
binding sites. It is also not clear whether SiTaR prepro- 
cesses highly variable-length binding sites as this was not 
stated in [17]. These issues however are not the focus of 
our work. 

Application of LASAGNA-ChIP to ChlP-seq data 

To demonstrate the use of LASAGNA-ChIP on ChlP- 
seq data, we retrieved mouse ChlP-seq data produced by 
the Encyclopedia of DNA Elements (ENCODE) project 
[44] from the UCSC Genome Browser [39]. All the 38 
peak files in the Narrow Peaks format that matches 
pattern ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/ 
database/wgEncode*Tfbs*Plc were downloaded on Oct. 
12, 2012, where "*" is the wildcard character matching 
zero or more characters. These files give signal peak loca- 
tion besides start and end for each peak and hence the 
corresponding signal files do not need to be processed 

Table 6 Distribution of the 1751 binding sites of 90 TFs in 
TRANSFAC public database 



Species # TFBSs 1 



Homo sapiens 


735 


Mus musculus 


346 


Rattus norvegicus 


278 


Saccharomyces cerevisiae 


158 


Drosophila melanogaster 


155 


Gallus gallus 


73 


Bos taurus 


5 


Sus scrofa 


1 



1. Total number of TFBSs. 
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Figure 3 Comparison of the PSSM method dependent on LASAGNA to SiTaR. (a) Scatter plot of precision by LASAGNA-PSSM against precision 
by SiTaR at the same recall rate for each TF. Each point corresponds to aTF. Seventy-three percent (65 out of 89) of theTFs are above the reference 
line, indicating that LASAGNA-PSSM is more precise for the 65 TFs. (b) Plots of precision against recall for LASAGNA-PSSM and SiTaR based on all the 
90TFs. 



by a peak-finding algorithm. Four distinct cell types 
and 17 distinct target TFs are present in the 38 ChlP- 
seq experiments. Additional file 3 lists, for each ChlP- 
seq experiment, the cell, target TF, number of peaks 
as well as the minimum, maximum, mean and stan- 
dard deviation of peak length. We observe that the peak 
length varies greatly. The mean peak length can be as 
long as 1124, while the highest standard deviation is 
nearly 876. 

It is useful to know if LASAGNA-ChIP is able to align 
peak sequences and reveal the motif of the ChlPed TF. To 
align peak sequences, parameter 7<" a was searched from 0 
to 8 to obtain the alignment with the highest IC. MEME 
was also used to align peak sequences because it is often 
the choice of method. In fact, MEME is used by 5 out 
of 6 tools compared in [45] for ChlP-seq data analysis. 
The MEME parameters are described in section Compari- 
son of alignment algorithms, where the one-per-sequence 
model is assumed. To ensure that both methods finish 
within reasonable time, for each experiment, we randomly 
sampled 300 peaks for alignment. We did not distinguish 
large peaks from small ones because ChlP-seq experi- 
ments require large numbers of cells and hence "a small 
peak could represent very strong binding in only a subset 
of the cells" [46]. 

For each alignment, we searched for the resulting motif 
in 386 UniPROBE mouse motifs and 398 motifs derived 
from all the matrices in the TRANSFAC Public database. 
The search was accomplished by software TOMTOM 
[47]. We used Pearson correlation as the distance mea- 
sure, required a minimal overlap of 5 nucleotides, and set 
the E-value cut-off to 5. Additional file 4 shows, for each 
ChlP-seq experiment, the sequence logos of motifs found 
by LASAGNA-ChIP and MEME. The matching motifs 



found by TOMTOM are listed under each sequence logo 
[48] by E-value. In case more than 10 significant motifs 
were found, only the 10 most significant ones were shown. 
The one matching the ChlPed TF is highlighted in yellow 
for each ChlP-seq experiment. 

We first notice that overall the motifs found by 
LASAGNA-ChIP and MEME are very similar by visual 
inspection. No significant difference is observed in terms 
of motif IC (p-value: 0.1252). For both LASAGNA-ChIP 
and MEME, the ChlPed TF motifs were found for 31 
experiments. Among the other 7 experiments are one 
MYB in MEL cells, all the ETS1 in CH12 and MEL cells, 
one JUND in MEL cells, one MAX in C2C12 cells, all the 
TBP in CH12 and MEL cells. Interestingly, LASAGNA- 
ChIP and MEME differ only for 4 out of these 7 exper- 
iments. They are one ETS1 in CH12 cells, one MAX 
in C2C12 cells and two TBP in CH12 and MEL cells. 
Although LASAGNA-ChIP and MEME differ in these 
cases, the found motifs still warrant further analyses. 
For instance, the motif for ETS1 in CH12 cells found 
by LASAGNA-ChIP resembles the secondary motif of 
Gabpa, which is a known paralog. 

For the other 3 out of the 7 experiments, LASAGNA- 
ChIP and MEME produced similar motifs. The one found 
for MYB in MEL resembles those of GATA proteins. 
This agrees with a recent study reporting that MYB and 
GATA-3 cooperatively regulate IL-13 by direct binding to 
a conserved GATA-3 response element [49]. Since this 
motif is based on 300 peak sequences, it is likely that the 
two proteins similarly regulate other genes in MEL cells. 
The motif for ETS1 in MEL cells also matches those of 
GATA proteins. Cooperation between ETS1 and GATA-3 
in regulating IL-5 was also suggested [50,51]. Finally, while 
the motif for JUND in MEL cells matches two motifs in 
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-950 
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Transcripts 
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Symbol 


mRNA Range 


Homo 
sapiens 


chr17 


950 to 

+ 32582296 CCL2 NM 002982 

+50 



Hits 



gil224589808:32581 346-32582345 Homo sapiens chromosome 17, 
GRCh37.p5 Primary Assembly 


Name 


Sequence 


Position 
(0-based) 


Strand 


Score f 

179.53 


value 


AP-1 
(T00029) 


ACTGCTGAGACCAAAT 


749 




0 


MEF-2A 
(T0 1005) 


CCTTAAAAATAACCC 


674 


+ 


78.11 


7.5E-5 


Sp1 

(T00759) 


AGAGAGGGCGGAGTCAA 


893 


275.28 0 


000125 


P50 

(T00593) 


GGGGAATTTAC 


280 


+ 


73.09 


0.0002 


SRF 

(T00764) 


GCTCTCAAGCTTTGGCCGGCTC 


525 




190.72 0 


000275 


NF-1 

(T00539) 


GCTGGCAGCGAGCCTG 


591 


+ 


01.77 


0.0003 


c-Fos 
(TO0 123) 


CAGTGTCATACT 


346 




62.82 0 


.000325 


NF-IL6-2 
(T00581) 


AGATAAAGTTGGGGAATTTACA 


270 


+ 


82.55 0 


000325 


POU2F2 
(Oct-2.1) 
(T00646) 


TGCATTTGAATGAGCA 


486 


+ 


74.74 


0.0004 



Figure 4 Partial results of scanning the promoter of human gene CCL2.The list of predicted binding sites are sorted byp-value in ascending 
order while only the top-4 hits are shown. The best hit is visualized in the context of other binding sites over a stretch of the promoter, where the 
height of a box is — log 10 p-value. CCL2 is known to be a target gene of AP-1, Sp1 and p50 [28], These 3 binding sites are not in theTRANSFAC 
public database and were not used to build the PSSMs. 
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the TRANSFAC and UniPROBE databases, the matches 
are likely false positives since no literature support was 
found. 

While it is not specifically designed to be a de novo 
motif discovery method, LASAGNA-ChIP aligns all the 
peak sequences and finds the most informative motif. The 
assumption that a motif instance is present in each peak 
sequence may not hold for some experiments. Because 
of several possible binding models [46], two or more 
motifs may be present in subsets of the peak sequences. 
Discovery of more than one motif will be enabled for 
LASAGNA-ChIP in the near future. 

LASAGNA is simple and effective 

Unlike MEME and similar methods, the order in which 
the input sequences are aligned is crucial to LASAGNA 
and ClustalW. ClustalW relies on a guide tree based on 
pairwise alignments to decide the order. LASAGNA, on 
the other hand, depends on the length of a sequence 
and its similarity to the partial alignment. LASAGNA- 
ChIP is well-suited for a TF whose shortest site misses 
the core or contains only a fraction of it. We, however, 
observed no significant difference between LASAGNA 
and LASAGNA-ChIP on TFBSs in the TRANSFAC Public 
database. This is because, for these TFBSs, a shortest site 
often fully contains the core. Hence, our assumption holds 
true in general. 

For ChlP-seq data, the assumption that short sequences 
contain less irrelevant bases flanking the core may not 
hold. However, we observe that, under the one-per- 
sequence model, LASAGNA-ChIP performed compara- 
bly well to MEME in aligning ChlP-seq peak sequences. 
We attempted other orders such as from the longest 
sequence to the shortest one and found that aligning 
the shortest sequence first does have its advantage (data 
not shown). Also, we note that, for 11 out of 38 exper- 
iments, the peak sequences are all at least 100 bp (see 
Additional file 3) and hence all the peak sequences are 
100 bp long after clipping. This implies that LASAGNA- 
ChIP is capable of handling sequences of the same 
length. 

LASAGNA-ChIP, MEME and methods alike produce 
gapless alignments and do have their limits. When a TF 
binds to two cores separated by a variable-length spacer, 
these methods are expected to align the canonical TFBSs 
containing spacers of the most prevalent length. These 
binding patterns are also known as two-block motifs. 
Gapped alignment or explicit modeling [52] is needed to 
correctly align TFBSs of this nature. 

Implementation 

We have implemented a user-friendly webtool named 
LASAGNA-Search, which is freely available at http:// 
biogrid.engr.uconn.edu/lasagna_search/. Useful features 



include automatic promoter retrieval, visualization of hits 
locally and at the UCSC Genome Browser, and automatic 
gene regulatory network construction based on significant 
hits. LASAGNA-Search adopts the LASAGNA-PSSM 
method and currently stores PSSM/q models (PSSM mod- 
els for short), where K s is determined by CV experiments, 
for the 189 TRANSFAC TFs summarized in Table 1 as well 
as 133 TFs from ORegAnno (08NovlO dump). In Addi- 
tional file 5, we list each model with its counterpart for 
the same TF if one is found in matrices in TRANSFAC 
Public. We do not evaluate models by IC because higher 
IC implies higher specificity but not necessarily higher 
sensitivity. Comparison with models in other databases is 
beyond the scope of this study but will be investigated in 
the near future. 

LASAGNA-Search estimates jj-values of PSSM scores 
empirically because the PSSM model for a TF may score 
nucleotide pairs in addition to individual nucleotides. 
When iC s = 0, a PSSM score is considered the sum of 
independent variables and hence the exact p-vahie can be 
efficiently computed [53]. Even with this independence 
assumption, the scores of nucleotide pairs at (1, 2) and (2, 
3), for instance, are never independent. Hence, a PSSM 
score cannot be seen as the sum of independent variables 
when nucleotide pairs are scored. The empirical PSSM 
score distribution of a TF is obtained from scanning a ran- 
dom sequence simulated by f(u), u € {A, C, G, T}, where 
f(u) is estimated from all the TFBSs used to build the 
PSSM. LASAGNA-Search focuses on only PSSM scores 
in the upper 5% and hence scores in the lower 95% are 
given a /?-value of 0.05+. Currently, the smallest nonzero 
/?-value is 2.5 x 10 -5 and 0 means any number less than 
2.5 x 10" 5 . 

As a case study, we scanned the promoter region of 
human gene CCL2 (NCBI Gene ID 6347), also known as 
MCP1. CCL2 was arbitrarily chosen by browsing the ORe- 
gAnno database [28]. The promoter sequence (-950 to +50 
relative to the transcription start site) was automatically 
retrieved and scanned for binding sites of all 68 human 
TFs with the j?-value threshold set to 0.001. Figure 4 
displays a partial view of the search results ordered by p- 
value. The only 3 true positive hits, AP-1, Spl and p50 
(NFKB1), were found in the top-4 of the list. Accord- 
ing to ORegAnno, CCL2 is a target gene of AP-1, Spl, 
NFKB1, STAT1 and GAS, where GAS likely refers to the 
gamma activated site bound by STAT1. STAT1, however, 
is not one of the 68 TFs and hence all the TFs known 
to regulate CCL2 were recalled. The fact that AP-1, Spl 
and p50 regulate CCL2 is also documented in TRANS- 
FAC [27] (T00029, T00759 and T00593). The actual sites 
(R14639, R14638 and R14640), however, are not in the 
public release and were not used to build the PSSM mod- 
els. We note that this case study is for illustration not 
evaluation purposes. 
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Conclusions 

We proposed LASAGNA, a novel alignment algorithm 
specifically designed for aligning variable-length tran- 
scription factor binding sites. Cross-validation results 
on 189 TFs and 4771 TFBSs indicated that LASAGNA 
significantly outperformed ClustalW2 (p-value: 1.22 x 
10" 15 ) and MEME Q?-value: 3.55 x 10" 15 ). This is 
because LASAGNA was specifically designed for align- 
ing variable-length TFBSs. Based on the success of 
LASAGNA, we developed LASAGNA-ChIP, which is 
capable of handling sequences produced by ChlP-chip 
and ChlP-seq experiments. While ClustalW2 is bet- 
ter suited for producing structurally correct alignments, 
LASAGNA-ChIP, MEME and methods alike can be used 
to align sequences produced by ChlP-chip or ChlP-seq 
experiments. 

We compared LASAGNA-PSSM, the PSSM method 
dependent on LASAGNA, to SiTaR, an alignment free 
TFBS search method. Cross-validation experiments were 
conducted on 1751 TFBSs of 90 TFs for both methods. 
The results showed that, at fixed recall rates, LASAGNA- 
PSSM is significantly more precise than SiTaR (p-value: 
2.66 x 10 -8 ). The recall-precision curve showed that our 
method is constantly more precise at any recall rate or 
more sensitive at any precision. 

We conclude that the LASAGNA algorithm is sim- 
ple and effective in aligning variable-length binding sites. 
It has been integrated into a user-friendly webtool for 
TFBS search called LASAGNA-Search. The tool cur- 
rently stores precomputed PSSM models for 189 TFs 
and 133 TFs built from TFBSs in the TRANS FAC Pub- 
lic database (release 7.0) and the ORegAnno database 
(08NovlO dump), respectively. In the future, more sources 
of experimentally validated TFBSs such as the PAZAR 
database will be incorporated into the webtool, making 
variable-length TFBSs more accessible to scientists in the 
field. 
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Additional file 1: LASAGNA-ChIP flowchart. 

Additional file 2: Distribution of A' s by species and conserved domain. 

Additional file 3: Summary of 38 mouse ChlP-seq experiments. Each 
row shows the track name in the UCSC Genome Browser, cell type, target 
TF, number of peak sequences as well as the minimum, maximum, mean 
and standard deviation of peak sequence length. 

Additional file 4: Motifs found by LASAGNA-ChIP and MEME. For each 
ChlP-seq experiment, the sequence logos of motifs found by 
LASAGNA-ChIP and MEME are shown. The matching motifs in the 
TRANSFAC Public and UniPROBE databases found by TOMTOM are listed 
below each sequence logo. The first ChlPed motif TF is highlighted in 
yellow if it is among the matching motifs. When the found motif does not 
resemble those of the ChlPed TF, the first cofactor of the ChlPed TF is 
highlighted in blue if it is among the matching motifs. Other possibly 
correct matches are highlighted in green. 



Additional file 5: List of LASAGNA-built models based on 
TRANSFAC/ORegAnno TFBSs. Only models whose counterparts can be 
found in matrices in TRANSFAC Public are listed. The IC and number of sites 
are shown for each model. 
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